across()library("tidyverse")
library("skimr")
library("janitor")
library("palmerpenguins")
Recall that: group_by() and summarize() work great together.
penguins %>%
group_by(island) %>%
summarize(mean_body_mass_g=mean(body_mass_g, na.rm=T)) # remember to remove those NA's!
## # A tibble: 3 × 2
## island mean_body_mass_g
## <fct> <dbl>
## 1 Biscoe 4716.
## 2 Dream 3713.
## 3 Torgersen 3706.
What if we are interested in the number of observations (penguins) by species and island?
penguins %>%
group_by(island, species) %>%
summarize(n_penguins=n(), .groups = 'keep')
## # A tibble: 5 × 3
## # Groups: island, species [5]
## island species n_penguins
## <fct> <fct> <int>
## 1 Biscoe Adelie 44
## 2 Biscoe Gentoo 124
## 3 Dream Adelie 56
## 4 Dream Chinstrap 68
## 5 Torgersen Adelie 52
Recall that that count() works like a combination of
group_by() and summarize() but just shows the
number of observations.
penguins %>%
count(island, species)
## # A tibble: 5 × 3
## island species n
## <fct> <fct> <int>
## 1 Biscoe Adelie 44
## 2 Biscoe Gentoo 124
## 3 Dream Adelie 56
## 4 Dream Chinstrap 68
## 5 Torgersen Adelie 52
tabyl() will also produce counts, but the output is
different. It is just a matter of personal preference.
penguins %>%
tabyl(island, species) # tabyl is part of the janitor package
## island Adelie Chinstrap Gentoo
## Biscoe 44 0 124
## Dream 56 68 0
## Torgersen 52 0 0
across()Across allows us to usefilter() and
select() across multiple variables.
Here we are summarizing the mean of all variables that contain “mm”.
penguins %>%
summarize(across(contains("mm"), mean, na.rm=T))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(contains("mm"), mean, na.rm = T)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <dbl>
## 1 43.9 17.2 201.
group_by also works.
penguins %>%
group_by(sex) %>%
summarize(across(contains("mm"), mean, na.rm=T))
## # A tibble: 3 × 4
## sex bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <dbl> <dbl> <dbl>
## 1 female 42.1 16.4 197.
## 2 male 45.9 17.9 205.
## 3 <NA> 41.3 16.6 199
Here we summarize across all variables.
penguins %>%
summarise_all(n_distinct)
## # A tibble: 1 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <int> <int> <int> <int> <int> <int>
## 1 3 3 165 81 56 95
## # ℹ 2 more variables: sex <int>, year <int>
Operators can also work, here I am summarizing
n_distinct() across all variables except
species, island, and sex.
penguins %>%
summarize(across(!c(species, island, sex, year),
mean, na.rm=T))
## # A tibble: 1 × 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <dbl> <dbl>
## 1 43.9 17.2 201. 4202.
All variables that include “bill”…all of the other dplyr operators also work.
penguins %>%
summarise(across(starts_with("bill"), n_distinct))
## # A tibble: 1 × 2
## bill_length_mm bill_depth_mm
## <int> <int>
## 1 165 81
library("tidyverse")
library("naniar")
library("skimr")
library("janitor")
life_history <- read_csv("data_midterm2/mammal_lifehistories_v3.csv") %>% clean_names()
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(life_history)
## Rows: 1,440
## Columns: 13
## $ order <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Artiodac…
## $ family <chr> "Antilocapridae", "Bovidae", "Bovidae", "Bovidae", "Bovid…
## $ genus <chr> "Antilocapra", "Addax", "Aepyceros", "Alcelaphus", "Ammod…
## $ species <chr> "americana", "nasomaculatus", "melampus", "buselaphus", "…
## $ mass <dbl> 45375.0, 182375.0, 41480.0, 150000.0, 28500.0, 55500.0, 3…
## $ gestation <dbl> 8.13, 9.39, 6.35, 7.90, 6.80, 5.08, 5.72, 5.50, 8.93, 9.1…
## $ newborn <chr> "3246.36", "5480", "5093", "10166.67", "not measured", "3…
## $ weaning <dbl> 3.00, 6.50, 5.63, 6.50, -999.00, 4.00, 4.04, 2.13, 10.71,…
## $ wean_mass <dbl> 8900, -999, 15900, -999, -999, -999, -999, -999, 157500, …
## $ afr <dbl> 13.53, 27.27, 16.66, 23.02, -999.00, 14.89, 10.23, 20.13,…
## $ max_life <dbl> 142, 308, 213, 240, 0, 251, 228, 255, 300, 324, 300, 314,…
## $ litter_size <dbl> 1.85, 1.00, 1.00, 1.00, 1.00, 1.37, 1.00, 1.00, 1.00, 1.0…
## $ litters_year <dbl> 1.00, 0.99, 0.95, NA, NA, 2.00, NA, 1.89, 1.00, 1.00, 0.7…
summary(life_history)
## order family genus species
## Length:1440 Length:1440 Length:1440 Length:1440
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mass gestation newborn weaning
## Min. : -999 Min. :-999.00 Length:1440 Min. :-999.00
## 1st Qu.: 50 1st Qu.:-999.00 Class :character 1st Qu.:-999.00
## Median : 403 Median : 1.05 Mode :character Median : 0.73
## Mean : 383577 Mean :-287.25 Mean :-427.17
## 3rd Qu.: 7009 3rd Qu.: 4.50 3rd Qu.: 2.00
## Max. :149000000 Max. : 21.46 Max. : 48.00
##
## wean_mass afr max_life litter_size
## Min. : -999 Min. :-999.00 Min. : 0.00 Min. :-999.000
## 1st Qu.: -999 1st Qu.:-999.00 1st Qu.: 0.00 1st Qu.: 1.000
## Median : -999 Median : 2.50 Median : 0.00 Median : 2.270
## Mean : 16049 Mean :-408.12 Mean : 93.19 Mean : -55.634
## 3rd Qu.: 10 3rd Qu.: 15.61 3rd Qu.: 147.25 3rd Qu.: 3.835
## Max. :19075000 Max. : 210.00 Max. :1368.00 Max. : 14.180
##
## litters_year
## Min. :0.140
## 1st Qu.:1.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:2.000
## Max. :7.500
## NA's :689
New from the purrr package. Gives quick summary of
number of NA’s in each variable
life_history %>%
map_df(~ sum(is.na(.))) # It's telling you theres no NA because it read -999 as a number not NA
## # A tibble: 1 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0
## # ℹ 3 more variables: max_life <int>, litter_size <int>, litters_year <int>
naniar: package built to manage NA’sA single approach to deal with NA’s in this data set
life_history_no_nas <-
read_csv(file = "data_midterm2/mammal_lifehistories_v3.csv", na = c("NA", " ", ".", "-999")) %>% clean_names() # notice that I am creating a new object/ dataframe that doesn't have any NA's
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
miss_var_summary provides a clean summary of NA’s across
the data frame.
naniar::miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 wean_mass 1039 72.2
## 2 litters_year 689 47.8
## 3 weaning 619 43.0
## 4 afr 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter_size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max_life 0 0
Notice that max_life has no NA’s. Does that make sense
in the context of this data? (no)
hist(life_history_no_nas$max_life) # we found another way NA's are represented
Let’s use mutate() and na_if() to replace
0’s with NA’s in max_life.
life_history_no_nas <-
life_history_no_nas %>%
mutate(max_life=na_if(max_life, 0))
miss_var_summary(life_history_no_nas)
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 wean_mass 1039 72.2
## 2 max_life 841 58.4
## 3 litters_year 689 47.8
## 4 weaning 619 43.0
## 5 afr 607 42.2
## 6 gestation 418 29.0
## 7 mass 85 5.90
## 8 litter_size 84 5.83
## 9 order 0 0
## 10 family 0 0
## 11 genus 0 0
## 12 species 0 0
## 13 newborn 0 0
We can also use miss_var_summary with
group_by(). This helps us better evaluate where NA’s are in
the data.
life_history_no_nas %>%
group_by(order) %>%
select(order) %>%
miss_var_summary(order=T)
## # A tibble: 0 × 3
## # Groups: order [0]
## # ℹ 3 variables: order <chr>, n_miss <int>, pct_miss <dbl>
naniar also has a nice replace function which will allow
you to precisely control which values you want replaced with NA’s in
each variable.
life_history %>% # going back to original data
replace_with_na(replace = list(newborn = "not measured",
weaning= -999,
wean_mass= -999,
afr= -999,
max_life= 0,
litter_size= -999,
gestation= -999,
mass= -999)) %>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 wean_mass 1039 72.2
## 2 max_life 841 58.4
## 3 litters_year 689 47.8
## 4 weaning 619 43.0
## 5 afr 607 42.2
## 6 newborn 595 41.3
## 7 gestation 418 29.0
## 8 mass 85 5.90
## 9 litter_size 84 5.83
## 10 order 0 0
## 11 family 0 0
## 12 genus 0 0
## 13 species 0 0
# Makes replacement of NA's very specific
You can also use naniar to replace a specific value (like -999) with NA across the entire data set.
life_history %>% #going back to the original data
replace_with_na_all(condition = ~.x == -999)%>%
miss_var_summary()
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 wean_mass 1039 72.2
## 2 litters_year 689 47.8
## 3 weaning 619 43.0
## 4 afr 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter_size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max_life 0 0
Finally, naniar has some built-in examples of common values or character strings used to represent NA’s. The chunk below will use these built-in parameters to replace NA’s across the entire data set.
common_na_strings
## [1] "missing" "NA" "N A" "N/A" "#N/A" "NA " " NA"
## [8] "N /A" "N / A" " N / A" "N / A " "na" "n a" "n/a"
## [15] "na " " na" "n /a" "n / a" " a / a" "n / a " "NULL"
## [22] "null" "" "\\?" "\\*" "\\."
common_na_numbers
## [1] -9 -99 -999 -9999 9999 66 77 88
life_history %>% #going back to the original data
replace_with_na_all(condition = ~.x %in% c(common_na_strings, common_na_numbers)) %>%
mutate(newborn=na_if(newborn, "not measured"))
## # A tibble: 1,440 × 13
## order family genus species mass gestation newborn weaning wean_mass afr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Artiod… Antil… Anti… americ… 4.54e4 8.13 3246.36 3 8900 13.5
## 2 Artiod… Bovid… Addax nasoma… 1.82e5 9.39 5480 6.5 NA 27.3
## 3 Artiod… Bovid… Aepy… melamp… 4.15e4 6.35 5093 5.63 15900 16.7
## 4 Artiod… Bovid… Alce… busela… 1.5 e5 7.9 10166.… 6.5 NA 23.0
## 5 Artiod… Bovid… Ammo… clarkei 2.85e4 6.8 <NA> NA NA NA
## 6 Artiod… Bovid… Ammo… lervia 5.55e4 5.08 3810 4 NA 14.9
## 7 Artiod… Bovid… Anti… marsup… 3 e4 5.72 3910 4.04 NA 10.2
## 8 Artiod… Bovid… Anti… cervic… 3.75e4 5.5 3846 2.13 NA 20.1
## 9 Artiod… Bovid… Bison bison 4.98e5 8.93 20000 10.7 157500 29.4
## 10 Artiod… Bovid… Bison bonasus 5 e5 9.14 23000.… 6.6 NA 30.0
## # ℹ 1,430 more rows
## # ℹ 3 more variables: max_life <dbl>, litter_size <dbl>, litters_year <dbl>
There is another package visdat that can be used to
visualize the proportion of different classes of data, including missing
data. But, it is limited by size.
library(visdat)
vis_dat(life_history) #classes of data
vis_miss(life_history)
If you are sure that you know how NA’s are treated in the data, then
you can deal with them in advance using na() as part of the
readr package.
life_history_advance <-
readr::read_csv(file = "data_midterm2/mammal_lifehistories_v3.csv",
na = c("NA", " ", ".", "-999")) #all NA, blank spaces, .,and -999 are treated as NA
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 1440 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): order, family, Genus, species, newborn
## dbl (8): mass, gestation, weaning, wean mass, AFR, max. life, litter size, l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
miss_var_summary(life_history_advance)
## # A tibble: 13 × 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 wean mass 1039 72.2
## 2 litters/year 689 47.8
## 3 weaning 619 43.0
## 4 AFR 607 42.2
## 5 gestation 418 29.0
## 6 mass 85 5.90
## 7 litter size 84 5.83
## 8 order 0 0
## 9 family 0 0
## 10 Genus 0 0
## 11 species 0 0
## 12 newborn 0 0
## 13 max. life 0 0
pivot_long()library(tidyverse)
Tidy data follows three conventions:
(1) each variable has its own column
(2) each observation has its own row
(3) each value has its own cell
pivot_longer(): shifts data from wide (data entry in
spreadsheets, column names may represent values of variable) to long
format.Rules:
+ pivot_longer(cols, names_to, values_to) +
cols - Columns to pivot to longer format +
names_to - Name of the new column; it will contain the
column names of gathered columns as values + values_to -
Name of the new column; it will contain the data stored in the values of
gathered columns
The following data show results from a drug trial with four treatments on six patients. The values represent resting heart rate.
heartrate <- read_csv("data_midterm2/heartrate.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate
## # A tibble: 6 × 5
## patient a b c d
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret 72 74 80 68
## 2 Frank 84 84 88 76
## 3 Hawkeye 64 66 68 64
## 4 Trapper 60 58 64 58
## 5 Radar 74 72 78 70
## 6 Henry 88 87 88 72
To fix untidy data, we need to reshape the table to long format while
keeping track of column names and values. We do this using
pivot_longer(). Notice that the dimensions of the data
frame change.
heartrate %>%
pivot_longer(-patient, #patient will not move
names_to = "drug", #make a new column called "drug"
values_to="heartrate" #values moved to a new column called "heartrate"
)
## # A tibble: 24 × 3
## patient drug heartrate
## <chr> <chr> <dbl>
## 1 Margaret a 72
## 2 Margaret b 74
## 3 Margaret c 80
## 4 Margaret d 68
## 5 Frank a 84
## 6 Frank b 84
## 7 Frank c 88
## 8 Frank d 76
## 9 Hawkeye a 64
## 10 Hawkeye b 66
## # ℹ 14 more rows
Some (but not all) of the column names are data. We also have NA’s.
billboard <- read_csv("data_midterm2/billboard.csv")
## Rows: 317 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist, track
## dbl (65): wk1, wk2, wk3, wk4, wk5, wk6, wk7, wk8, wk9, wk10, wk11, wk12, wk...
## lgl (11): wk66, wk67, wk68, wk69, wk70, wk71, wk72, wk73, wk74, wk75, wk76
## date (1): date.entered
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
Solution 1: specify a range of columns that you want to pivot.
billboard2 <-
billboard %>%
pivot_longer(wk1:wk76, # a range of columns
names_to = "week",
values_to = "rank",
values_drop_na = TRUE #this will drop the NA's
)
billboard2
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Solution 2: OR, specify columns that you want to stay fixed.
billboard3 <-
billboard %>%
pivot_longer(-c(artist, track, date.entered), #specific columns to stay fixed (by using c, meaning dont move these)
names_to = "week",
values_to = "rank",
values_drop_na = TRUE # this will drop the NA's
)
billboard3
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Solution 3: identify columns by a prefix, remove the prefix and all NA’s.
billboard %>%
pivot_longer(
cols = starts_with("wk"), #columns that start with "wk"
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE)
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ℹ 5,297 more rows
In this case, there are two observations in each column name.
qpcr_untidy <- read_csv("data_midterm2/qpcr_untidy.csv")
## Rows: 5 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gene
## dbl (9): exp1_rep1, exp1_rep2, exp1_rep3, exp2_rep1, exp2_rep2, exp2_rep3, e...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qpcr_untidy
## # A tibble: 5 × 10
## gene exp1_rep1 exp1_rep2 exp1_rep3 exp2_rep1 exp2_rep2 exp2_rep3 exp3_rep1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 21.7 19.8 20.7 18.3 20.4 17.6 20.6
## 2 B 24.3 24.8 25.2 26.0 29.9 26.4 25.4
## 3 C 20.7 21.5 21.3 25.5 18.7 22.3 21.9
## 4 D 26.9 28.0 27.7 33.1 24.3 28.9 28.5
## 5 E 25.0 22.7 23.8 21.1 23.4 20.2 23.7
## # ℹ 2 more variables: exp3_rep2 <dbl>, exp3_rep3 <dbl>
names_sep helps pull these apart, but we still have
“exp” and “rep” to deal with.
qpcr_untidy %>%
pivot_longer(
exp1_rep1:exp3_rep3,
names_to= c("experiment", "replicate"),
names_sep="_", #names seperate by underscore
values_to="mRNA_expression"
)
## # A tibble: 45 × 4
## gene experiment replicate mRNA_expression
## <chr> <chr> <chr> <dbl>
## 1 A exp1 rep1 21.7
## 2 A exp1 rep2 19.8
## 3 A exp1 rep3 20.7
## 4 A exp2 rep1 18.3
## 5 A exp2 rep2 20.4
## 6 A exp2 rep3 17.6
## 7 A exp3 rep1 20.6
## 8 A exp3 rep2 19.9
## 9 A exp3 rep3 19.2
## 10 B exp1 rep1 24.3
## # ℹ 35 more rows
pivot_wider()pivot_longer()Recall that we use pivot_longer() when our column names
actually represent variables. A classic example would be that the column
names represent observations of a variable.
datasets::USPersonalExpenditure
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
?USPersonalExpenditure
Here we add a new column of expenditure types, which are stored as
rownames above, with mutate(). The
USPersonalExpenditures data also needs to be converted to a
data frame before we can use the tidyverse functions, because it comes
as a matrix.
expenditures <- USPersonalExpenditure %>%
as_tibble() %>% #this transforms the matrix into a data frame
mutate(expenditure = rownames(USPersonalExpenditure))
expenditures
## # A tibble: 5 × 6
## `1940` `1945` `1950` `1955` `1960` expenditure
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 22.2 44.5 59.6 73.2 86.8 Food and Tobacco
## 2 10.5 15.5 29 36.5 46.2 Household Operation
## 3 3.53 5.76 9.71 14 21.1 Medical and Health
## 4 1.04 1.98 2.45 3.4 5.4 Personal Care
## 5 0.341 0.974 1.8 2.6 3.64 Private Education
separate()In this new heart rate example, we have the sex of each patient included with their name. Are these data tidy? No, there is more than one value per cell in the patient column and the columns a, b, c, d once again represent values.
heartrate2 <- read_csv("data_midterm2/heartrate2.csv")
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): patient
## dbl (4): a, b, c, d
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
heartrate2
## # A tibble: 6 × 5
## patient a b c d
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret_f 72 74 80 68
## 2 Frank_m 84 84 88 76
## 3 Hawkeye_m 64 66 68 64
## 4 Trapper_m 60 58 64 58
## 5 Radar_m 74 72 78 70
## 6 Henry_m 88 87 88 72
We need to start by separating the patient names from their sexes.
separate() needs to know which column you want to split,
the names of the new columns, and what to look for in terms of breaks in
the data.
heartrate2 %>%
separate(patient, into= c("patient", "sex"), sep = "_")
## # A tibble: 6 × 6
## patient sex a b c d
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Margaret f 72 74 80 68
## 2 Frank m 84 84 88 76
## 3 Hawkeye m 64 66 68 64
## 4 Trapper m 60 58 64 58
## 5 Radar m 74 72 78 70
## 6 Henry m 88 87 88 72
Re-examine heartrate2. Use separate() for
the sexes, pivot_longer() to tidy, and
arrange() to organize by patient and drug. Store this as a
new object heartrate3.
heartrate3 <- heartrate2 %>%
separate(patient, into=c("patient", "sex"), sep="_") %>%
pivot_longer(-c(patient, sex),
names_to = "drug",
values_to = "heartrate")
heartrate3
## # A tibble: 24 × 4
## patient sex drug heartrate
## <chr> <chr> <chr> <dbl>
## 1 Margaret f a 72
## 2 Margaret f b 74
## 3 Margaret f c 80
## 4 Margaret f d 68
## 5 Frank m a 84
## 6 Frank m b 84
## 7 Frank m c 88
## 8 Frank m d 76
## 9 Hawkeye m a 64
## 10 Hawkeye m b 66
## # ℹ 14 more rows
unite() is the opposite of separate(). Its syntax is
straightforward. You only need to give a new column name and then list
the columns to combine with a separation character. Give it a try below
by recombining patient and sex from heartrate3.
heartrate3 %>%
unite(patient_sex, "patient", "sex", sep=" ") #puts sexes back in name
## # A tibble: 24 × 3
## patient_sex drug heartrate
## <chr> <chr> <dbl>
## 1 Margaret f a 72
## 2 Margaret f b 74
## 3 Margaret f c 80
## 4 Margaret f d 68
## 5 Frank m a 84
## 6 Frank m b 84
## 7 Frank m c 88
## 8 Frank m d 76
## 9 Hawkeye m a 64
## 10 Hawkeye m b 66
## # ℹ 14 more rows
pivot_wider()The opposite of pivot_longer(). You use
pivot_wider() when you have an observation scattered across
multiple rows. In the example below, cases and
population represent variable names not observations.
Rules:
+ pivot_wider(names_from, values_from)
+ names_from - Values in the names_from column
will become new column names
+ values_from - Cell values will be taken from the
values_from column
tb_data <- read_csv("data_midterm2/tb_data.csv")
## Rows: 12 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, key
## dbl (2): year, value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tb_data
## # A tibble: 12 × 4
## country year key value
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
When using pivot_wider() we use names_from
to identify the variables (new column names) and
values_from to identify the values associated with the new
columns.
tb_data_wide <- tb_data %>%
pivot_wider(names_from = "key", #the observations under key will become new columns
values_from = "value") #the values under value will be moved to the new columns
tb_data_wide
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
pivot_longer to transform them back to long!
tb_data_long <- tb_data_wide %>%
pivot_longer(c(-country, - year),
names_to = "key",
values_to = "value")
tb_data_long
## # A tibble: 12 × 4
## country year key value
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
ggplot part 1We first need to define some of the data types we will use to build plots.
discrete quantitative data that only contains
integerscontinuous quantitative data that can take any
numerical valuecategorical qualitative data that can take on a limited
number of valuesThe syntax used by ggplot takes some practice to get used to, especially for customizing plots, but the basic elements are the same. It is helpful to think of plots as being built up in layers.
In short, plot= data + geom_ + aesthetics.
To make things easy, let’s start with some built in data.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
To make a plot, we need to first specify the data and map the aesthetics. The aesthetics include how each variable in our data set will be used. In the example below, I am using the aes() function to identify the x and y variables in the plot.
ggplot(data=iris, #specify the data
mapping=aes(x=Species, y=Petal.Length)) #map the aesthetics
# species is a factor on x-axis
Here we specify that we want a boxplot, indicated by
geom_boxplot().
ggplot(data=iris, #specify the data
mapping=aes(x=Species, y=Petal.Length))+ #map the aesthetics
geom_boxplot() #add the plot type
Now that we have a general idea of the syntax, let’s start by working with two common plots: 1) scatter plots and 2) bar plots.
homerange <- read_csv("data_midterm2/Tamburelloetal_HomeRangeDatabase.csv")
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression aka. “best of fit” lines or calculate any r2 values.
In the case below, we are exploring whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
ggplot(data=homerange, #specify the data
mapping=aes(x=log10.mass, y=log10.hra))+ #map the aesthetics
geom_point() #add the plot type
In big data sets with lots of overlapping values, over-plotting can
be an issue. geom_jitter() is similar to
geom_point() but it helps with over plotting by adding some
random noise to the data and separating some of the individual
points.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
geom_jitter() #adds some random noise to separate the points a little
To add a regression (best of fit) line, we just add another layer.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra))+
geom_point()+
geom_smooth(method=lm, se=T) #add a regression line (linear, standard error)
## `geom_smooth()` using formula = 'y ~ x'
geom_bar()The simplest type of bar plot counts the number of observations in a
categorical variable. In this case, we want to know how many
observations are present in the variable trophic.guild.
Notice that we do not specify a y-axis because it is count by
default.
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
homerange %>%
count(trophic.guild)
## # A tibble: 2 × 2
## trophic.guild n
## <chr> <int>
## 1 carnivore 342
## 2 herbivore 227
Also notice that we can use pipes! The mapping= function
is implied by aes and so is often left out.
homerange %>%
ggplot(aes(x=trophic.guild)) + #only works with categorical data
geom_bar() #good for counts
geom_col()Unlike geom_bar(), geom_col() allows us to
specify an x-axis and a y-axis.
homerange %>%
filter(family=="salmonidae") %>%
select(common.name, log10.mass) %>%
ggplot(aes(y=common.name, x=log10.mass))+ #notice the switch in x and y
geom_col()+
coord_flip() #flips the graph
geom_bar() with stat="identity"
stat="identity" allows us to map a variable to the y-axis
so that we aren’t restricted to counts.
homerange %>%
filter(family=="salmonidae") %>%
ggplot(aes(x=common.name, y=log10.mass))+
geom_bar(stat="identity") # need to specify this
ggplot part 1For the next series of examples, we will use the
homerange data. Database of vertebrate home range
sizes.
Boxplots help us visualize a range of values. So, on the x-axis we
typically have something categorical and the y-axis is the range. In the
case below, we are plotting log10.mass by taxonomic class
in the homerange data. geom_boxplot() is the
geom type for a standard box plot. The center line in each box
represents the median, not the mean.
Let’s look at the variable log10.mass grouped by
taxonomic class.
homerange %>%
group_by(class) %>%
summarize(min_log10.mass=min(log10.mass),
max_log10.mass=max(log10.mass),
median_log10.mass=median(log10.mass))
## # A tibble: 4 × 4
## class min_log10.mass max_log10.mass median_log10.mass
## <chr> <dbl> <dbl> <dbl>
## 1 actinopterygii -0.658 3.55 2.08
## 2 aves 0.712 4.95 1.82
## 3 mammalia 0.620 6.60 3.33
## 4 reptilia 0.539 4.03 2.51
homerange %>%
ggplot(aes(x = class, y = log10.mass)) +
geom_boxplot()
ggplot part 2Let’s revisit the mammal life history data to practice our ggplot skills. The data are from: S. K. Morgan Ernest. 2003. Life history characteristics of placental non-volant mammals. Ecology 84:3402.
life_history <- read_csv("data_midterm2/mammal_lifehistories_v2.csv", na="-999") %>% clean_names()
Bar plots count the number of observations in a categorical variable.
What is the difference between geom_bar and
geom_col? Make two bar plots showing the number of
observations for each order using each geom type.
names(life_history)
## [1] "order" "family" "genus" "species" "mass"
## [6] "gestation" "newborn" "weaning" "wean_mass" "afr"
## [11] "max_life" "litter_size" "litters_year"
geom_col
life_history %>%
count(order, sort=T) %>% # This is the same as arrange
ggplot(aes(x=order, y=n))+
geom_col()+
coord_flip()
geom_bar
life_history %>%
ggplot(aes(x=order))+
geom_bar()+
coord_flip()
Remember that ggplot build plots in layers. These layers can
significantly improve the appearance of the plot. What if we wanted a
bar plot of the mean mass for each order? Would we use
geom_bar or geom_col? (geom_col)
life_history %>%
group_by(order) %>%
summarize(mean_mass = mean(mass, na.rm=T)) %>%
ggplot(aes(x=order, y=mean_mass))+
geom_col()+
coord_flip()
There are a few problems here. First, the y-axis is in scientific notation. We can fix this by adjusting the options for the session.
options(scipen=999)#cancels scientific notation for the session
Next, the y-axis is not on a log scale. We can fix this by adding
scale_y_log10().
life_history %>%
group_by(order) %>%
summarize(mean_mass = mean(mass, na.rm=T)) %>%
ggplot(aes(x=order, y=mean_mass))+
geom_col()+
coord_flip()+
scale_y_log10()
Lastly, we can adjust the x-axis labels to make them more readable.
We do this using reorder.
life_history %>%
group_by(order) %>%
summarize(mean_mass = mean(mass, na.rm=T)) %>%
ggplot(aes(x=reorder(order, mean_mass), y=mean_mass))+ # This allows us to order the x-axis by mean_mass
geom_col()+
coord_flip()+
scale_y_log10()
Scatter plots allow for comparisons of two continuous variables. Make a scatterplot below that compares gestation time and weaning mass.
life_history %>%
ggplot(aes(x=gestation, y=wean_mass))+
geom_jitter(na.rm = T)+ # prevents overplotting
scale_y_log10()
Box plots help us visualize a range of values. So, on the x-axis we typically have something categorical and the y-axis is the range. Let’s make a box plot that compares mass across taxonomic orders.
life_history %>%
ggplot(aes(x=order, y=mass))+
geom_boxplot(na.rm = T)+
coord_flip()+
scale_y_log10()
Now that we have practiced scatter plots, bar plots, and box plots we need to learn how to adjust their appearance to suit our needs. Let’s start with labeling x and y axes.
elephants <- read_csv("data_midterm2/elephantsMF.csv") %>% clean_names()
## Rows: 288 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Sex
## dbl (2): Age, Height
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Make a plot that compares age and height of elephants.
elephants %>%
ggplot(aes(x=age, y=height))+
geom_point()+
geom_smooth(method = lm, sex = F)
## Warning in geom_smooth(method = lm, sex = F): Ignoring unknown parameters:
## `sex`
## `geom_smooth()` using formula = 'y ~ x'
The plot looks clean, but it is incomplete. A reader unfamiliar with
the data might have a difficult time interpreting the labels. To add
custom labels, we use the labs command.
elephants %>%
ggplot(aes(x=age, y=height)) +
geom_point() +
geom_smooth(method=lm, se=F)+
labs(title="Elephant Age vs. Height", # adds a title
x="Age (years)",
y="Height (cm)")
## `geom_smooth()` using formula = 'y ~ x'
We can improve the plot further by adjusting the size and face of the
text. We do this using theme(). The rel()
option changes the relative size of the title to keep things consistent.
Adding hjust allows control of title position.
elephants %>%
ggplot(aes(x=age, y=height)) +
geom_point() +
geom_smooth(method=lm, se=F)+ # se = standard error along with line
labs(title="Elephant Age vs. Height", # adds a title
x="Age (years)",
y="Height (cm)")+
theme(plot.title=element_text(size=rel(1.5), hjust = 0.5)) # rel = title relative to the plot, hjust = horizontal (left = 0, right = 1)
## `geom_smooth()` using formula = 'y ~ x'
There are lots of options for aesthetics. An aesthetic can be
assigned to either numeric or categorical data. fill is a
common grouping option; notice that an appropriate key is displayed when
you use one of these options.
elephants %>%
ggplot(aes(x=sex, fill=sex))+ #fill is a grouping option
geom_bar()
size adjusts the size of points relative to a continuous
variable.
life_history %>%
ggplot(aes(x=gestation, y=log10(mass), size=mass))+ # Made size relative to mass of thing plotted
geom_point(na.rm=T)
ggplot part 2homerange <-
read_csv("data_midterm2/Tamburelloetal_HomeRangeDatabase.csv", na = c("", "NA", "\\"))
## Rows: 569 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): taxon, common.name, class, order, family, genus, species, primarym...
## dbl (8): mean.mass.g, log10.mass, mean.hra.m2, log10.hra, dimension, preyma...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are many options to create nice plots in ggplot. One useful
trick is to store the plot as a new object and then experiment with
geom’s and aesthetics. Let’s setup a plot that compares
log10.mass and log10.hra. Notice that we are
not specifying a geom.
p <- homerange %>%
ggplot(aes(x= log10.mass, y= log10.hra)) # store base plot as object
Play with point size by adjusting the size argument.
p + geom_point(size=1.25)
We can color the points by a categorical variable.
p + geom_point(aes(color=thermoregulation), size=1.5) # For scatterplots, "color" is the same as "fill" in barplots
We can also map shapes to another categorical variable.
p + geom_point(aes(color=thermoregulation, shape=thermoregulation), size=1.5)
At this point you should be comfortable building bar plots that show
counts of observations using geom_bar(). Last time we
explored the fill option as a way to bring color to the
plot; i.e. we filled by the same variable that we were plotting. What
happens when we fill by a different categorical variable?
Let’s start by counting how many observations we have in each taxonomic
group.
homerange %>% count(taxon, sort=T)
## # A tibble: 9 × 2
## taxon n
## <chr> <int>
## 1 mammals 238
## 2 birds 140
## 3 marine fishes 90
## 4 snakes 41
## 5 river fishes 14
## 6 turtles 14
## 7 tortoises 12
## 8 lizards 11
## 9 lake fishes 9
Now let’s make a bar plot of these data.
homerange %>%
ggplot(aes(x=taxon))+
geom_bar()+
coord_flip()+
labs(title="Observations by Taxon",
x="Taxonomic Group")
By specifying fill=trophic.guild we build a stacked bar
plot that shows the proportion of a given taxonomic group that is an
herbivore or carnivore.
homerange %>%
ggplot(aes(x=taxon, fill=trophic.guild))+
geom_bar()+
coord_flip()+
labs(title="Observations by Taxon",
x="Taxonomic Group")
We can also have counts of each trophic guild within taxonomic group
shown side-by-side by specifying position="dodge".
homerange %>%
ggplot(aes(x=taxon, fill=trophic.guild))+
geom_bar(position="dodge")+
coord_flip()+
labs(title="Observations by Taxon",
x="Taxonomic Group")
Here is the same plot oriented vertically.
homerange %>%
ggplot(aes(x=taxon, fill=trophic.guild))+
geom_bar(position="dodge")+
theme(axis.text.x=element_text(angle=60, hjust = 1))+ #change angle of x-axis observations
labs(title="Observations by Taxon",
x="Taxonomic Group")
We can also scale all bars to a percentage.
homerange %>%
ggplot(aes(x = taxon, fill = trophic.guild))+
geom_bar(position = position_fill())+
scale_y_continuous(labels = scales::percent)+
coord_flip()
groupIn addition to fill, group is an aesthetic
that accomplishes the same function but does not add color.
Here is a box plot that shows log10.mass by taxonomic
class.
homerange %>%
ggplot(aes(x = class, y = log10.mass)) +
geom_boxplot()
I use group to make individual box plots for each taxon
within class.
homerange %>%
ggplot(aes(x = class, y = log10.mass, group = taxon)) +
geom_boxplot()
I can also use fill to associate the different taxa with
a color coded key.
homerange %>%
ggplot(aes(x = class, y = log10.mass, fill = taxon)) +
geom_boxplot(alpha=0.5)
ggplot part 3library(tidyverse)
library(RColorBrewer)
library(paletteer)
library(janitor)
options(scipen=999) #cancels the use of scientific notation for the session
For this tutorial, we will use two data sets: deserts and homerange
deserts <- read_csv("data_midterm2/surveys_complete.csv")
Line plots are great when you need to show changes over time. Here we
look at the number of samples for species DM and DS over the years
represented in the deserts data. This takes some careful
thought- we want to know how sampling has changed over time for these
two species.
Let’s start by making a clear x and y so we know what we are going to plot.
deserts %>%
filter(species_id=="DM" | species_id == "DS") %>%
mutate(year=as.factor(year)) %>% # Change to factor bc R would think you're doing a numeric (addition etc.) but you dont add years
group_by(year, species_id) %>%
summarize(n=n(), .groups = 'keep') %>%
pivot_wider(names_from = species_id, values_from = n)
## # A tibble: 26 × 3
## # Groups: year [26]
## year DM DS
## <fct> <int> <int>
## 1 1977 264 98
## 2 1978 389 320
## 3 1979 209 204
## 4 1980 493 346
## 5 1981 559 354
## 6 1982 609 354
## 7 1983 528 280
## 8 1984 396 76
## 9 1985 667 98
## 10 1986 406 88
## # ℹ 16 more rows
deserts %>%
filter(species_id=="DM" | species_id == "DS") %>%
mutate(year=as.factor(year)) %>% # If you remove this gaps would be big and treated as numeric
group_by(year, species_id) %>%
summarize(n=n(), .groups = 'keep') %>%
ggplot(aes(x=year, y=n, group=species_id, color = species_id))+
geom_line()+
geom_point(shape=2)+ # You can experiment with shapes
theme(axis.text.x=element_text(angle=60, hjust=1))+
labs(title="Number of Samples for Species DM & DS",
x = "Year",
y = "n")
Histograms are frequently used by biologists; they show the
distribution of continuous variables. As students, you have seen
histograms of grade distributions. A histogram bins the
data and you specify the number of bins that encompass a range of
observations. For something like grades, this is easy because the number
of bins corresponds to the grades A-F. By default, R uses a formula to
calculate the number of bins but some adjustment may be required.
What does the distribution of body mass look like in the
homerange data?
homerange %>%
ggplot(aes(x = log10.mass)) +
geom_histogram(bins = 30)+ #we can adjust the number of bins with the bins argument
labs(title = "Distribution of Body Mass")
Let’s play with the colors. This shows all 657 of R’s built-in
colors. Notice that color and fill do
different things!
#grDevices::colors()
Let’s rebuild the histogram, but this time we will specify the color and fill. Do a little experimentation on your own with the different colors.
homerange %>%
ggplot(aes(x = log10.mass)) +
geom_histogram(color = "black", fill = "powderblue", bins=10)+
labs(title = "Distribution of Body Mass")
Density plots are similar to histograms but they use a smoothing function to make the distribution more even and clean looking. They do not use bins.
homerange %>%
ggplot(aes(x = log10.mass)) +
geom_density(fill="deepskyblue4", alpha =0.4, color = "black")+ #alpha is the transparency
labs(title = "Distribution of Body Mass")
I like to see both the histogram and the density curve so I often plot them together. Note that I assign the density plot a different color.
homerange %>%
ggplot(aes(x=log10.mass)) +
geom_histogram(aes(y = after_stat(density)), fill = "deepskyblue4", alpha = 0.4, color = "black")+
geom_density(color = "red")+
labs(title = "Distribution of Body Mass")
case_when() is a very handy function from
dplyr which allows us to calculate a new variable from
other variables. We use case_when() within
mutate() to do this.case_when() allows us to
specify multiple conditions. Let’s reclassify the body mass variable
into a new factor variable with small, medium, and large animals. In
this case, we are making a continuous variable into a categorical
variable.
glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon <chr> "lake fishes", "river fishes", "river fishe…
## $ common.name <chr> "american eel", "blacktail redhorse", "cent…
## $ class <chr> "actinopterygii", "actinopterygii", "actino…
## $ order <chr> "anguilliformes", "cypriniformes", "cyprini…
## $ family <chr> "anguillidae", "catostomidae", "cyprinidae"…
## $ genus <chr> "anguilla", "moxostoma", "campostoma", "cli…
## $ species <chr> "rostrata", "poecilura", "anomalum", "fundu…
## $ primarymethod <chr> "telemetry", "mark-recapture", "mark-recapt…
## $ N <chr> "16", NA, "20", "26", "17", "5", "2", "2", …
## $ mean.mass.g <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00,…
## $ log10.mass <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600,…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mean.hra.m2 <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, 3…
## $ log10.hra <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437,…
## $ hra.reference <chr> "Minns, C. K. 1995. Allometry of home range…
## $ realm <chr> "aquatic", "aquatic", "aquatic", "aquatic",…
## $ thermoregulation <chr> "ectotherm", "ectotherm", "ectotherm", "ect…
## $ locomotion <chr> "swimming", "swimming", "swimming", "swimmi…
## $ trophic.guild <chr> "carnivore", "carnivore", "carnivore", "car…
## $ dimension <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3…
## $ preymass <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, N…
## $ log10.preymass <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA, …
## $ PPMR <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, NA…
## $ prey.size.reference <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 20…
homerange %>%
select(log10.mass) %>%
summarize(min=min(log10.mass),
max=max(log10.mass))
## # A tibble: 1 × 2
## min max
## <dbl> <dbl>
## 1 -0.658 6.60
summary(homerange$log10.mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6576 1.6990 2.5185 2.5947 3.3324 6.6021
homerange %>%
mutate(mass_category=case_when(log10.mass<=1.75 ~ "small",
log10.mass>1.75 & log10.mass <= 2.75 ~ "medium",
log10.mass>2.75 ~ "large"))
## # A tibble: 569 × 25
## taxon common.name class order family genus species primarymethod N
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 lake fishes american e… acti… angu… angui… angu… rostra… telemetry 16
## 2 river fishes blacktail … acti… cypr… catos… moxo… poecil… mark-recaptu… <NA>
## 3 river fishes central st… acti… cypr… cypri… camp… anomal… mark-recaptu… 20
## 4 river fishes rosyside d… acti… cypr… cypri… clin… fundul… mark-recaptu… 26
## 5 river fishes longnose d… acti… cypr… cypri… rhin… catara… mark-recaptu… 17
## 6 river fishes muskellunge acti… esoc… esoci… esox masqui… telemetry 5
## 7 marine fish… pollack acti… gadi… gadid… poll… pollac… telemetry 2
## 8 marine fish… saithe acti… gadi… gadid… poll… virens telemetry 2
## 9 marine fish… lined surg… acti… perc… acant… acan… lineat… direct obser… <NA>
## 10 marine fish… orangespin… acti… perc… acant… naso litura… telemetry 8
## # ℹ 559 more rows
## # ℹ 16 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## # alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## # hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## # trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## # PPMR <dbl>, prey.size.reference <chr>, mass_category <chr>
Here we check how the newly created body mass categories compare
across trophic.guild.
homerange %>%
mutate(mass_category=case_when(log10.mass<=1.75 ~ "small",
log10.mass>1.75 & log10.mass <= 2.75 ~ "medium",
log10.mass>2.75 ~ "large")) %>%
ggplot(aes(x=mass_category, fill=trophic.guild))+
geom_bar(position="dodge")
range_category
that breaks down log10.hra into very small, small, medium,
and large classes based on quartile.summary(homerange$log10.hra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.523 3.653 4.595 4.709 6.016 9.550
library(gtools)
#install.packages("gtools")
quartiles <- quantcut(homerange$log10.hra)
table(quartiles)
## quartiles
## [-1.52,3.65] (3.65,4.59] (4.59,6.02] (6.02,9.55]
## 143 142 142 142
homerange %>%
mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
log10.hra>=6.02 ~ "large"))
## # A tibble: 569 × 25
## taxon common.name class order family genus species primarymethod N
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 lake fishes american e… acti… angu… angui… angu… rostra… telemetry 16
## 2 river fishes blacktail … acti… cypr… catos… moxo… poecil… mark-recaptu… <NA>
## 3 river fishes central st… acti… cypr… cypri… camp… anomal… mark-recaptu… 20
## 4 river fishes rosyside d… acti… cypr… cypri… clin… fundul… mark-recaptu… 26
## 5 river fishes longnose d… acti… cypr… cypri… rhin… catara… mark-recaptu… 17
## 6 river fishes muskellunge acti… esoc… esoci… esox masqui… telemetry 5
## 7 marine fish… pollack acti… gadi… gadid… poll… pollac… telemetry 2
## 8 marine fish… saithe acti… gadi… gadid… poll… virens telemetry 2
## 9 marine fish… lined surg… acti… perc… acant… acan… lineat… direct obser… <NA>
## 10 marine fish… orangespin… acti… perc… acant… naso litura… telemetry 8
## # ℹ 559 more rows
## # ℹ 16 more variables: mean.mass.g <dbl>, log10.mass <dbl>,
## # alternative.mass.reference <chr>, mean.hra.m2 <dbl>, log10.hra <dbl>,
## # hra.reference <chr>, realm <chr>, thermoregulation <chr>, locomotion <chr>,
## # trophic.guild <chr>, dimension <dbl>, preymass <dbl>, log10.preymass <dbl>,
## # PPMR <dbl>, prey.size.reference <chr>, range_category <chr>
range_category.homerange %>%
mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
log10.hra>=6.02 ~ "large")) %>%
ggplot(aes(x=range_category, fill=class))+
geom_bar(position = "dodge", alpha=0.6, color = "black")
range_category and plot the range of
log10.mass by taxonomic class.homerange %>%
mutate(range_category=case_when(log10.hra<3.65 ~ "very small",
log10.hra>=3.65 & log10.hra<=4.59 ~ "small",
log10.hra>=4.59 & log10.hra<=6.02 ~ "medium",
log10.hra>=6.02 ~ "large")) %>%
filter(range_category=="small") %>%
ggplot(aes(x=class, y=log10.mass, fill=class))+
geom_boxplot()
ggplot part 3There are many options to change the theme of your plots within ggplot. Have a look [here]https://ggplot2.tidyverse.org/reference/ggtheme.html) for a list of the themes.
Let’s start by building a simple barplot.
p <- homerange %>%
ggplot(aes(x=taxon, fill=trophic.guild))+
geom_bar(na.rm=T, position="dodge")
Have a look at the linedraw theme; I am adding it as
another layer.
p + theme_linedraw()+
theme(axis.text.x = element_text(angle = 60, hjust=1))+
labs(title = "Observations by Taxon in Homerange Data",
x = NULL,
y= "n",
fill= "Trophic Guild")
There are lots of options to manipulate legends. Have a look here.
p+theme_linedraw()+
theme(legend.position = "top",
axis.text.x = element_text(angle = 60, hjust=1))+
labs(title = "Observations by Taxon in Homerange Data",
x = NULL, # removes the label from the x axis
y= "n",
fill= "Trophic Guild")
ggthemesThere are many packages that include additional themes, one of which is ggthemes. Some of these are nice because they are designed to mimic the look of popular publications.
#install.packages("ggthemes")
library(ggthemes)
Here is a list of the ggthemes
#ls("package:ggthemes")[grepl("theme_", ls("package:ggthemes"))]
p +
theme_fivethirtyeight()+
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 60, hjust=1))+
labs(title = "Observations by Taxon in Homerange Data",
x = NULL,
y= "n",
fill= "Trophic Guild")
The default colors used by ggplot are often uninspiring. They don’t make plots pop out in presentations or publications, and you may want to use a customized palette to make things visually consistent.
Access the help for RColorBrewer.
?RColorBrewer
The thing to notice is that there are three different color palettes:
1) sequential, 2) diverging, and 3) qualitative. Within each of these
there are several selections. You can bring up the colors by using
display.brewer.pal(). Specify the number of colors that you
want and the palette name.
display.brewer.pal(5,"YlGnBu") #sequential palette
The R Color Brewer website is very helpful for getting an idea of the color palettes. To make things easy, use these two guidelines:
+scale_colour_brewer() is for points
+scale_fill_brewer() is for fills
Here I chose the Paired palette. Take a moment and
experiment with other options.
p+scale_fill_brewer(palette = "Paired")+
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 60, hjust=1))+
labs(title = "Observations by Taxon in Homerange Data",
x = NULL,
y= "n",
fill= "Trophic Guild")
You can also use paleteer to build a custom palette for
consistency. To access the paleteer collection, I add it to
a new object.
colors <- paletteer::palettes_d_names
Now we can display the palettes. Assign the palette to
my_palette and then build this base R bar plot. There are a
lot of options; paleteer is a collection of popular
palettes. I really like the [ggsci package] (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)
my_palette <- paletteer_d("futurevisions::atomic_orange") #Store your palette
barplot(rep(1,6), axes=FALSE, col=my_palette) # Show your palette
Now we just identify my_palette as part of
scale_fill_manual()
p+scale_fill_manual(values=my_palette)+
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 60, hjust=1))+
labs(title = "Observations by Taxon in Homerange Data",
x = NULL,
y= "n",
fill= "Trophic Guild")
Faceting is one of the amazing features of ggplot. It allows us to make multi-panel plots for easy comparison. Here is a boxplot that shows the range of log10.mass by taxon.
homerange %>%
ggplot(aes(x=taxon, y=log10.mass))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 60, hjust=1))
There are other categorical variables that might be interesting to
overlay. facet_wrap() makes a ribbon of panels by a
specified categorical variable and allows you to control how you want
them arranged.
homerange %>%
ggplot(aes(x=taxon, y=log10.mass))+
geom_boxplot()+
facet_wrap(~trophic.guild)+
theme(axis.text.x = element_text(angle = 60, hjust=1))
facet_grid() allows control over the faceted variable;
it can be arranged in rows or columns. rows~columns.
homerange %>%
ggplot(aes(x=taxon, y=log10.mass))+
geom_boxplot()+
facet_grid(trophic.guild~.)+
theme(axis.text.x = element_text(angle = 60, hjust=1))
facet_grid() will also allow the comparison of two
categorical variables, just remember a~b where a is rows and b is
columns.
homerange %>%
ggplot(aes(x=taxon, y=log10.mass))+
geom_boxplot()+
facet_grid(trophic.guild~thermoregulation)+
theme(axis.text.x = element_text(angle = 60, hjust=1))